Qianli Ma
I possess extensive engineering experience across both Machine Learning Systems (MLSys) and Large Language Model Algorithms.
My goal is to advance next-generation AGI systems in order to create larger and better models.
I am deeply passionate about the latest technologies and actively contribute to the open-source community as a core contributor to several popular
open-source AI projects.
EDUCATION BACKGROUND
National University of Singapore
2022.8 - 2024.1
 Master of Computer Science
Singapore
- Focus on Machine Learning System in HPC-AI lab
- Dissertation: Maximizing parallelism in Distribted training for diffusion model
Zhejiang University
2018.9 - 2022.7
 B.Eng in Electronic Science and Technology
Hangzhou, China
- Shannon elite class of Information Science and Electronic Engineer College
- Minor in Intensive training Program of Innovation and Entrepreneeyrship(ITP) in Chu Kechen Honors College
- Honors: First Class Scholarship of Zhejiang University, Excellent Graduate of Zhejiang University
Other Honors: Second prize in the National High school Mathematics Competition(2017)
WORK EXPERIENCE
ByteDance  Seed
2023.12 - Present
 Senior Research Engineer
Shanghai
As one of the earliest members of the Seed Team, I focusing on the AI infrastructure, optimizing training performance for LLMs & multimodal understanding and generation models (with thousands of GPUs),
from pre-training to post-training.
In particular, I led a small team to develop VeOmni(an open-source multimodal training system).
I was deeply involved in the research and development of core models such as Seed-Thinking 1.5 and UI-TARS.
- VeOmni is a PyTorch-native training framework purpose-built for both multi-modal pre-training and post-training.
It natively supports DeviceMesh and DTensor, and integrates cutting-edge features including FSDP2, expert parallelism, sequence parallelism, and etc.
In Seed, Leveraged VeOmni to develop training infrastructure supporting diverse initiatives, including UI-TARS, model architecture exploration, and unified generative-understanding model research.
- veScale is a PyTorch-native LLM Training Framework with Dtenser-based ND Parallelism and Eager Mode Execution
- verl is a flexible, efficient and production-ready RL training library for large language models (LLMs).
-
UI-TARS is an open-source multimodal agent built upon a powerful vision-language model. It is capable of effectively performing diverse tasks within virtual worlds.
ByteDance  AML
2023.6 - 2023.12
  LLMs Research Intern at Seed-Project
Shanghai
 Conduct research about MLsys Learning system, Process-Supervised Reward Model (PRM),
SFT Data Selection, Agent for Data Analysis
- PRM: I built a complete pipeline for data processing, PRM model training, and evaluation,
and proposed a heuristic greedy search algorithm based on Process-Supervised Reward Models (PRM) (HGS-PRM),
which uses step-level feedback from PRM to optimize the reasoning paths of large language models; compared with the Chain-of-Thought (CoT) method,
this algorithm has improved the model's capabilities in mathematical reasoning and code generation.
-
SFT Data Selection: Developed DavIR, a model-centric data selection method that enables LLMs (LLaMA, Gemma) to outperform full-dataset training with only 6% of Alpaca data; extended it to DavIR-DPO,
boosting Zephyr-7B-SFT's alignment performance by 8% on AlpacaEval.
-
Agent for Data Analysis: built InfiAgent-DABench, a benchmark for evaluating agents on data analysis tasks.
I've developed Agent infrastructure components such as LLMs API call integration, vLLM-powered inference engines, Python sandboxes, as well as model training infrastructure.
HPC-AI Technology
2022.7 - 2023.5
  Machine Learning System Engineeer
Singapore
Joined as Employee #15, completing my master's degree while supporting the company's growth from Seed to Series A.
As a key developer on ColossalAI—the company's core training framework,
I also led the projects including Colossal Chat and ColoDiffusion,
driving GitHub stars from 0 to 20k+.
Beyond R&D, I contributed to commercialization strategies, grew the open-source community,
and participated in cloud product design.
-
Colossal Chat
-
I took on the core development work of training code for Coati (Colossal AI talking intelligence)large language model, and designed an entire training pipeline including instruction data collection, data preprocessing, distributed training and acceleration of the model, model alignment tuning, etc.
We also open-sourced the Coati7B and Coati13B large language models.
-
After open-sourcing Coati, which helped ColossalAI, it ranked first on the Github trending list for three consecutive days (but was eventually overtaken by The Algorithm,
Twitter's open source project released by Elon Musk). This had a huge impact on the community and
caused ColossalAI to gain more than 10k stars,
Becoming one of the fastest growing AI open source projects in the first quarter of 2023
- ColossalAI, A Unified Deep Learning System for Big Model Era
-
Developed in the core feature of ColossalAI, including heterogeneous memory management, pipeline parallelism, distributed model saving.
-
Participating in the development and support of ColossalAI as a distributed backend for PyTorch Lightning enables ColossalAI to integrate more easily with PyTorch Lightning
- Leading the development of the AIGC Big Model training solution: ColoDiffusion
-
As the core developer, built Diffusion training framework based on pytorch-lightning + ColosaalAI, which supports multiple training modes, which was officially reposted by Pytorch
-
Use zero optimizer, auto chunk, flash-attention, cpu offload and other technologies to break the memory wall, support large bacth acceleration training
- Fastfold(Optimizing AlphaFold Training and Inference on GPU Clusters)
I Support the data pre-processing Parallel(Triple the speed) for Fastfold by Ray,
solve the core bottleneck of MSA feature search and long pre-processing time for inference training
- Technology Stack:Python,C++,Cuda,Pytorch,Ray,colossal-AI, Pytorch-lightning,TensorRT,DeepSpeed,Huggingface
SenseTime   Large model training
2021.12 - 2022.6
  AI Researcher Internship
Hangzhou
I participated in the development of large-scale distributed machine learning training framework - Sensetime Spring of SenseTime, and research related to machine learning systems
- Advancing large models of target detection to the ground (Vision Transformer, Swin Transformer, etc.), Support SenseTime's general detection framework - POD using pytorch distributed data parallel training and mixed-presicion training
- Involved in MLops related work, machine learning cloud platform development, supporting model lifecycle management database
- Technology Stack:Python,C++,Cuda,Pytorch,go,Nebula DB
Huawei 2012 Lab   Distributed Parallel Lab
2021.7 - 2021.12
  AI Engineering Internship
Hangzhou
I Contributed to Mindspore, a full-scene deep learning framework; Developed three new features for Mindspore Lite
- Completed Mindspore Lite OpenGL texture converting core code, as the key feature of Mindspore Lite 1.6
- Accomplished the OpenCL beckend support of x86 platform ofr Mindspore Lite, developed GPU operators of Mindspore core
- Implemented and iterated the Log system of Mindspore on x86 platform based on Glog
- Technology Stack:C++,OpenCL,OpenGL,Cmake,Python
PUBLICATION
- Hu, Xueyu., Qianli Ma., et al. (2024) "InfiAgent-DABench: Evaluating Agents on Data Analysis Tasks." ICML, 2024 (Published)[Paper]
- Ma, Q., Zhou, H, et al.(2023). Let's reward step by step: Step-Level reward model as the Navigators for Reasoning. [Paper]
- Zhou, H., Liu, T., Ma, Q., et al. (2023). DavIR: Data Selection via Implicit Reward for Large Language Models. ACL 2025. (Published)[Paper]
- Qin, Y. ... Ma, Q., . . . Shi, G. (2025). UI-TARS: Pioneering Automated GUI Interaction with Native Agents.[Paper]
- Seed1.5-Thinking: Advancing Superb Reasoning Models with Reinforcement Learning
KNOWLEDGE & SKILLS
-
Program Language: Python, C++/C, Golang, Javascript
-
MLSys Full Stack: ColossalAI, VeOmni, DeepSpeed, Ray, Megatron-LM, verl, vllm, transformers, triton
-
LLMs Full Stack: Pre-train, SFT, RLHF, RLVR, Vision Language Model, Omni Model, Agentic
-
Tools:Linux,Vim,shell,Git,Docker,Cmake
CLUBS & ORGANISATIONAL EXPERIENCE
Zhejiang University Internet Society   Technology department  AI lab
2021.10 - 2022.8
String Program   Technology department   Member of the machine learning subdepartment
2020.7 - Present
Zhejiang University Electroacoustic Orchestra   Drummer of Six o'clock studio band
2018.11 - 2021.2